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Abstract 



We study an abstract optimization problem arising from biomolecular sequence 
analysis. For a sequence A of pairs {aijWi) for i = l,...,n and u'i > 0, a seg- 
ment A{i,j) is a consecutive subsequence of A starting with index i and ending 
with index j. The width of A[i,j) is w{i,j) = '^i<k<jWk: and the density is 
{Yli<k<j ^k) / w{i, j) . The maximum- density segment problem takes A and two val- 
ues L and U as input and asks for a segment of A with the largest possible density 
among those of width at least L and at most U. When U is unbounded, we pro- 
vide a relatively simple, 0(n)-time algorithm, improving upon the 0(n log ij)-time 
algorithm by Lin, Jiang and Chao. When both L and U are specified, there are no 
previous nontrivial results. We solve the problem in 0{n) time if Wi = 1 for all i, 
and more generally in 0(n + nlog(?7 — L + 1)) time when Wi > 1 for all i. 

Key words: bioinformatics, sequences, density 



1 Introduction 

Non-uniformity of nucleotide composition within genomic sequences was first 
revealed through thermal melting and gradient centrifugation [Inm66,MTB76]. 
The GC content of the DNA sequences in all organisms varies from 25% 
to 75%. GC-ratios have the greatest variations among bacteria's DNA se- 
quences, while the typical GC-ratios of mammalian genomes stay in 45-50%. 
The GC content of human DNA varies widely throughout the genome, rang- 
ing between 30% and 60%. Despite intensive research effort in the past two 
decades, the underlying causes of the observed heterogeneity remain con- 
tested [Bar00,BB86,Cha94,EW92,EW93,Fil87,FO99,Hol92,Sue88,WSL89] . Re- 
searchers [NL00,SFR+99] observed that the compositional heterogeneity is 
highly correlated to the GC content of the genomic sequences. Other in- 
vestigations showed that gene length [DMG95], gene density [ZCB96], pat- 
terns of codon usage [SAL"'"95], distribution of different classes of repeti- 
tive elements [DMG95,SMRB83], number of isochores [BarOO], lengths of iso- 
chores [NLOO], and recombination rate within chromosomes [FCCOl] are all 
correlated with GC content. More research exists related to GC-rich segments 
[GGA+98,HHJ+97,IAY+96,JFN97,MHL01,MRO97,SA93,WLOG02,WSE+99]. 

Although GC-rich segments of DNA sequences are important in gene recog- 
nition and comparative genomics, only a couple of algorithms for identify- 
ing GC-rich segments appeared in the literature. A widely used window- 
based approach is based upon the GC-content statistics of a fixed-length 
window [FS90,HDV+91,NL00,RLB00]. Due to the fixed length of windows, 
these practically fast approaches are hkely to miss GC-rich segments that 
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span more than one window. Huang [Hua94] proposed an algorithm to accom- 
modate windows with variable lengths. Specifically, by assigning —p points to 
each AT-pair and 1 — p points to each GC-pair, where p is a number with 
< p < 1, Huang gave a linear-time algorithm for computing a segment of 
length no less than L whose score is maximized. As observed by Huang, how- 
ever, this approach tends to output segments that are significantly longer than 
the given L. 

In this paper, we study the following abstraction of the problem. Let A be a 
sequence of pairs (a^, Wi) for i = 1, . . . , n and > 0. A segment A{i, j) is a con- 
secutive subsequence of A starting with index i and ending with index j. The 
width of A{i,j) is w{i,j) = J2i<k<j'Wk, and the density is {J2i<k<j 0'k)/w{i, j). 
Let L and U be positive values with L < U. The maximum- density segment 
problem takes A, L, and U as input and asks for a segment of A with the 
largest possible density among those of width at least L and at most U. This 
generalizes a previously studied model, which we term the uniform model, in 
which Wi = 1 for all i. All of the previous work discussed in this section in- 
volves the uniform model. We introduce the generalized model as it might be 
used to compress a sequence A of real numbers to reduce its sequence length 
and thus its density analysis time in practice or theory. 

In its most basic form, the sequence A corresponds to the given DNA sequence, 
where = 1 if the corresponding nucleotide in the DNA sequence is G or 
C; and aj = otherwise. In the work of Huang, sequence entries took on 
values of p and 1 — p for some real number < p < 1. More generally, we 
can look for regions where a given set of patterns occur very often. In such 
applications, could be the relative frequency with which the corresponding 
DNA character appears in the given patterns. Further natural applications 
of this problem can be designed for sophisticated sequence analyses such as 
mismatch density [Sel84], ungapped local alignments [AS98], and annotated 
multiple sequence alignments [SFR+99]. 

Nekrutendo and Li [NLOO], and Rice, Longden and Bleasby [RLBOO] employed 
algorithms for the case where L = U. This case is trivially solvable in 0{n) 
time using a sliding window of the appropriate length. More generally, when 
L ^ U, this yields a trivial 0{n{U — L + l)) algorithm. Huang [Hua94] studied 
the case where U = n, i.e., there is effectively no upper bound on the width of 
the desired maximum-density segments. He observed that an optimal segment 
exists with width at most 2L — 1. Therefore, this case is equivalent to the case 
with U — 2L — 1 and thus can be solved in 0{nL) time. Recently, Lin, Jiang, 
and Chao [LJC02] gave an 0(nlogL)-time algorithm for this case based on 
the introduction of right-skew partitions of a sequence. 

In this paper, we present an 0(n)-time algorithm which solves the maximum- 
density segment problem in the absence of upper bound U . When both lower 
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and upper bounds, L and U, arc specified, we provide an 0(n)-time algorithm 
for tlie uniform case, and an 0{n + n\og{U — L + l))-time algorithm when 
Wi > 1 for all i. Our results exploit the structure of locally optimal segments to 
improve upon the 0(nlogL)-time algorithm of Lin, Jiang, and Chao [LJC02] 
and to extend the results to arbitrary values of U. The remainder of this paper 
is organized as follows. Section 2 introduces some notation and definitions. In 
Section 3, we carefully review the previous work of Lin, Jiang and Chao, in 
which they introduce the concept of right-skew partitions. Our main results 
are presented in Section 4. 

Other related works include algorithms for the problem of computing a seg- 
ment (aj, . . . aj) with a maximum sum + ■ ■ ■ + as opposed to a maximum 
density. Bentley [Ben86] gave an 0(n)-time algorithm for the case where L = 
and U — n. Within the same linear time complexity, Huang [Hua94] solved 
the case with arbitrary L yet unbounded U . More recently, Lin, Jiang, and 
Chao [LJC02] solved the case with arbitrary L and U. 



2 Notation and Preliminctries 



We consider A to be a sequence of n objects, where each object is represented 

by a pair of two real numbers (oj, Wi) for i = 1, . . . , n and tui > 0. For i < j, 
we let A{i,j) denote that segment of A which begins at index i and ends 
with index j. We let w{i,j) denote the width of A{i,j), defined as w{i,j) ~ 
T.i<k<j'^k- We let denote the density of A{i,j), defined as 



\i<k<j 

We note that the prefix sums of the input sequence can be precomputed in 
0{n) time. With these, the values of w{i,j) and t^{i,j) can be computed in 
0(1) time for any using the following formulas. 



l<k<j l<k<i-l 



^ Y 

.l<k<j l<k<i-l 



The maximum- density segment problem is to find a segment A{i,j) of max- 
imum density, subject to L < w{i,j) < U. Without loss of generality, we 
assume that Wi < U for all i, as items with larger width could not be used in 
a solution, li Wi — 1 for all i, we denote this as the uniform model. 
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1 j 

2 for i ^ n downto 1 do 

3 while {w{i,j) > U) do 

4 J ^ i - 1 

5 end while 

6 U,^3 

7 end for 

Fig. 1. Algorithm for precomputing Ui for all i. 

For a given index i, we introduce the notation Lj for the minimum index such 
that w{i,Li) > L if such an index exists, and we let Ui denote the maximum 
index such that Ui > i and w{i, Ui) < U. A direct consequence of these 
definitions is that segment A{i,j) has width satisfying L < w{i,j) < [/ if and 
only if Li is well-defined and Li < j < U^. 

In the uniform model, the set of all such values is easily calculated in 0{n) 
time, as Lj = i + L — 1 for i < n — L + 1 and U^ = min(i + U — l,n). In 
general, the full set of Lj and Ui values can be precomputed in 0{n) time by 
a simple sweep-line technique. The precomputation of the Ui values is shown 
in Figure 1; a similar technique can be used for computing values. It is not 
difiicult to verify the correctness and efiiciency of these computations. 



3 Right-Skew Segments 

For the uniform model, Lin, Jiang and Chao [LJC02] define segment A{i, k) to 
be right-skew if and only if fi{i, j) < fi{j + l,k) for all i < j < k. They define a 
partition of a sequence A into segments A1A2 . . . A^ to be a decreasingly right- 
skew partition if it is the case that each A^ is right-skew, and that iJi{Ax) > 
IJ>{Ay) for any x < y. The prove the following Lemma. 

Lemma 1 Every sequence A has a unique decreasingly right-skew partition. 

We denote this unique partition as DRSP(74). Within the proof of the above 
lemma, the authors implicitly demonstrate the following fact. 

Lemma 2 If segment A{x,y) is not right-skew, thenI)RSP{A{x,y)) is precisely 
equal to the union of A{x, k) and DRSP(A(A; + 1, y)) where A{x, k) is the longest 
possible right-skew segement begining with index x. 

Because of this structural property, the decreasingly right-skew partitions of 
all suffixes of A{l,n) can be simultaneously represented by keeping a right- 
skew pointer, p[i], for each 1 < i < n. The pointer is such that A{i,p[i]) is 
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the first right-skew segment of DRSP(A(i, n)). They imphcitly use dynamic 
programming to construct all such right-skew pointers in 0{n) time. 

In order to find a maximum-density segment of width at least L, they proceed 
by independently searching for the "good partner" of each index i. The good 
partner of i is the index i' that maximizes i') while satisfying w{i, i') > L. 
In order to find each good partner, they make use of versions of the following 
three lemmas. 

Lemma 3 (Atomic) Let B, C and D be sequences with //(-B) < //(C) < 
H{D). Then n{BC) < n{BCD). 

Lemma 4 (Bitonic) Let B be a sequence and let DRSP(C) = C1C2 ■ ■ ■ Cm for 
sequence C which immediately follows B. Let k be the greatest index i G [O.m] 
that maximizes ^{BCiC2 ■ ■ ■ Ci). Then ii{BCiC2 ■ ■ ■ Ci) > ii{BC\C2 ■ ■ ■ Q+i) 
if and only if i > k. 

Lemma 5 Given a sequence B, let C denote the shortest segment of B real- 
izing the maximum density for those segments of width at least L. Then the 
width of C is at most 2L — 1. 

Without any upper bound on the desired segment length, the consequence 
of these lemmas is an 0(logL)-time algorithm for finding a good partner for 
arbitrary index i. Since only segments of width L or greater are of interest, 
the segment A{i, Li) must be included. If considering the possible inclusion 
of further elements. Lemma 3 assures that if part of a right-skew segment 
increases the density, including that entire segment is just as helpful (in the 
application of that lemma C represents part of a right-skew segment CD). 
Therefore, the good partner for i must be Lj or else the right endpoint of one 
of the right-skew segments from T)RSP{A(Li + 1, n)). Lemma 4 shows that the 
inclusion of each successive right-skew segment leads to a bitonic sequence of 
densities, thus binary search can be used to locate the good partner. Finally, 
Lemma 5 assures that at most L right-skew segments need be considered for 
inclusion, and thus the binary search for a given i runs in O(logL) time. The 
result is an 0(nlogL)-time algorithm for arbitrary L, with U — n. 

Though presented in terms of the uniform model, the definition of a right-skew 
segment involves only the densities of segments and so it applies equally to our 
more general model. Lemmas 1-4 remain valid in the general model. A variant 
of Lemma 5 can be achieved with the additional restriction that Wi> 1 for all 
i, and thus their 0(nlogL)-time algorithm applies subject to this additional 
restriction. 
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j j' 



Fig. 2. Segments in proof of Lemma 6. 
4 Improved Algorithms 

Our techniques are built upon the use of decreasingly right-skew partitions, 
as reviewed in Section 3. Our improvements are based upon the following 
observation. An exact good partner for an index i need not be found if it can 
be determined that such a partner would result in density no greater than that 
of a segment already considered. This observation allows us to use a sweep- 
line technique to replace the 0(logL)-time binary searches used by Lin, Jiang 
and Chao [LJC02] with sequential searches that run with an amortized time 
of 0(1). In particular, we make use of the following key lemma. 

Lemma 6 For a given j , assume A{j,j') is a maximum- density segment of 
those starting with index j, having L < w{j,j') < U , and ending with index in 
a given range [x,y\. For a given i < j, assume A{i,i') is a maximum- density 
segment of those starting with index i, having L < w{i, i') < U and ending in 
range [x,y]. If i' > f, then > l^{i,i'). 



PROOF. A typical such configuration is shown in Figure 2. By assumption, 
both indices i' and / lie within the range [x, y]. Since L < w{j,j') < w{j, i') < 
w{i,i') < U, the optimality of A{j,j') guarantees that ^{j^j') > fi{j,i'). This 
implies that j') > iJ,{j,i') > + l,i'). Since L < w{j,j') < w{i,j') < 
w{i,i') < U, the optimality of A{i,i') guarantees that n{i,i') > n{i,j'), which 
in turn implies -|- > iJ^{i,i') > ii{i,j'). Combining these inequalities, 
^ A*(j) i') ^ A*(/ + l) ^ A*(^) t^^s proving the claim that iJ>{j,j') > 
H{i,i'). □ 



Our high level approach is thus to find good partners for each left endpoint 
i, considering those indices in decreasing order. However, rather than finding 
the true good partner for each i, our algorithm considers only matching in- 
dices which are less than or equal to all previously found good partners, in 
accordance with Lemma 6. In this way, as we sweep from right to left over the 
left endpoints i, we also sweep from right to left over the relevant matching 
indices. 



7 



4-1 Maximum-Density Segment with Width at Least L 

In this section, we consider the problem of finding a segment with the max- 
imum possible density among those of width at least L. We begin by intro- 
ducing a sweep-line data structure which helps manage the search for good 
partners. 



4.1.1 A Sweep- Line Data Structure 

The data structure developed in this section is designed to answer queries of 
the following type for a given range [x,y], specified upon initialization. For 
left index i, the goal is to return a matching right index i' such that 
is maximized, subject to the constraints that i' e [x,y] and that w{i,i') > L. 
No upper bound on the segment length is considered by this structure. 

In order to achieve improved efficiency, the searches are limited in the following 
two ways: 

(1) The structure can be used to find matches for many different left indices, 

however such queries must be made in decreasing order. 

(2) When asked to find the match for a left index, the structure only finds 
the true good partner in the case that the good partner has index less 
than or equal to all previously returned indices. 

Our data structure augments the right-skew pointers for a given interval with 
additional information used to speed up searches for good partners. The struc- 
ture contains the following state information, relative to given parameters 
i ^ X <y <n: 

• A (static) array, p[k] for x -\- 1 < k < y, where A{k,p[k]) is the leftmost 
segment of DRSP(A(/c, y)). 

• A (static) sorted fist, S[k], for each x -\- 1 < k < y, containing all indices j 
for which p[j] — k. 

• Two indices £ and u (for "lower" and "upper"), whose values are non- 
increasing as the algorithm progresses. 

• A variable, b (for "bridge"), which is maintained so that A{b,p[b]) is the 
segment of DRSP {A{i,y)) which contains index u. 

These data structures are initialized with procedure InitializeL(a;, y), given 
in Figure 3. An example of an initialized structure is given in Figure 4. Lines 1- 
8 of InitializeL set the values p[k] as was done in the algorithm of Lin, 
Jiang and Chao [LJC02]. Therefore, we state the following fact, proven in 
that preceding paper. 
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procedure InitializeL(,T, assumes 1 < x < y < n 

1 for i ^ y downto a; + 1 do 

2 S\i] ^ 

3 p[i] <— i 

4 while < y) and iii{i,p\i]) < iiip\i] + + 1]))) do 

5 p[i] ^ p\p[i] + 1] 

6 end while 

7 Insert i at beginning of >S'[p[i]] 

8 end for 

9 i ^ y; u ^ y; b ^ y 

Fig. 3. InitializeL operation. 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 



a; 14154543414253 



p[i] 2 7 4 6 6 7 9 9 14 11 13 13 14 

2 4 

S[i] 



5 3 8 11 12 10 

6 7 9 13 14 



Fig. 4. Example of data structure after InitializeL(l, 14), with Wi = 1 for all i. 

Lemma 7 After a call to InitializeL(a;, y), p[k] is set for allx+l <k <y 
such that A{k,p[k]) is the leftmost segment of DRSP{ A{k,y)). 

We also prove the following nesting property of decreasingly right-skew parti- 
tions. 

Lemma 8 Consider two segments A{xi, y) and A{x2-i y) with a common right 
endpoint. Let A{k,k') be a segment of DRSP{A{xi,y)) and let A{m,m') be a 
segment o/DRSP(A(a;2, y)). It cannot be the case that k < m < k' < m' . 



PROOF. If A{k, k') is a segment of DRSP(A(a;i, y)), a repeated application of 
Lemma 2 assures that A{k,k') is the leftmost segment of DRSP(>l(/c, y)) and 
that A(k,k') is the longest possible right-skew segment of those starting with 
index k. 

We assume for contradiction that k <m <k' <m' , and consider the following 
three non-empty segments, A{k,m — 1), A{m,k') and A{k' + l,m'). Since 
A{k, k') is right-skew, it must be that fi{k,m — 1) < /j{m, k'). Since A{m,m') 
is right-skew, it must be that /i(m, k') < fi{k' + 1, m'). In this case, it must be 
that the combined segment A{k,m') is right-skew (this fact can be explicitly 
proven by application of Lin, Jiang and Chao's Lemma 4 [LJC02]). Therefore 
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procedure FindMatchL(i) 

1 while {i > 1 + max(a;, Lj)) do 

2 £^£-1 

3 if {p[£] > u) then 
4 

5 end if 

6 end while 

7 while (u > i) and {fJ.{i,b — 1) > n{i,p[b])) do 
8 

9 if (m > ^) then 

10 6 minimum /c e S[u] such that /c > £ 

11 end if 

12 end while 

13 return u 

Fig. 5. FiiidMatchL(i) operation. 

the existence of right-skew segment A{k,m') contradicts the assumption that 
A{k, k') is the longest right-skew segment beginning with index k. □ 

CoroIIciry 9 There cannot exist indices k and m such that k < m < p[k] < 
p[m]. 

PROOF. A direct result of Lemmas 7-8. □ 



We introduce the main query routine, FindMatchL, given in Figure 5. 

Lemma 10 // b is the minimum value satisfying £ < b < u < then 
A{b,p^]) is the segment ofT)'KS'P{A{£,y)) which contains index u. 

PROOF. By Lemma 7, A{b,p[b]) is the leftmost segment of DRSP(A(6, |/)), 
and as b < u < p[6]) contains index u. 

By repeated application of Lemma 2, DRSP(A(£, y)) equals + 
l,p[p[£] + 1]), and so on, until reaching right endpoint y. We claim that 
A{b,p\b]) must be part of that partition. If not, there must be some other 
A{m,p[m]) with m < b < p[m\. By Lemma 8, it must be that p[m\ > 
yet then we have m < b < u < p^] < p[m\. Such an m violates the assumed 
minimality of 6. □ 

Lemma 11 Whenever line 7 o/FindMatchL() is evaluated, b is the minimum 
value satisfying i < b < u < p[b] , if such a value exists. 



1 1 decrease £ 



1 1 bitonic search 
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PROOF. We show this by induction over time. When initiahzed, i = b — 
u = p[b] = y, and thus b is the only satisfying value. The only time this 
invariant can be broken is when the value of £ or m changes. £ is changed only 
when decremented at line 2 of FindMatchL. The only possible violation of the 
invariant would be if the new index i satisfies i < u < p[£]. This is exactly the 
condition handled by hnes 3-4. 

Secondly, u is modified only at line 8 of FindMatchL. Immediately before this 
line is executed the invariant holds. At this point, we claim that p[k] < b — 1 
for any values of k such that i < k < b. For k < b, Corollary 9 implies that 
either p[k] < b or p[k] > p[b]. If it were the case that p[k] > p[b] > u this 
would violate the minimality of b assumed at line 7. Therefore, it must be 
that p[k] < 6 — 1 for all £ < /c < 6 — 1. As M is reset to 6 — 1, the only possible 
values for the new bridge b are those indices k with p[k] = u, which is precisely 
the set S[b — 1] considered at fine 10 of FindMatchL. □ 

Lemma 12 Assume FindMatchL(i) is called with a value i less than that of all 
previous invocations and such that < y. Let mo be the most recently returned 
value from FindMatchL() or y if this is the first such call. Let A{i,m) be a 
maximum- density segment of those starting with i, having width at least L, and 
ending with m e [x,y]. Then FindMatchL(i) returns the value min(m, mo). 

PROOF. All segments which start with i, having width at least L and end- 
ing with m G [x,y\ must include interval 74(i, max(a;, Lj)). The loop starting 
at line 1 ensures that variable £ = 1 + max(a;, Li) upon the loop's exit. As 
discussed in Section 3, the optimal such m must either be £ — 1 or else among 
the right endpoints of DRSP(A(£, y)). 

Since u is only set within FindMatchL, it must be that u = tuq upon entering 
the procedure. By Lemmas 10-11, A{b,p[b]) is the right-skew segment con- 
taining index u in DRSP{A{i,y)). If — 1) < n{i,p[b]), the good partner 
must have index at least p[b] > u, by Lemma 4. In this case, the while loop is 
never entered, and the procedure returns thq — min(m, mo). 

In any other case, the true good partner for i is less than or equal to mo, 
and this good partner is found by the while loop of line 1, in accordance with 
Lemmas 3-4. □ 

Lemma 13 // FindMatchL(i) returns value i' , it must be the case that for 
some j > i, segment A{j, i') is a maximum- density segment of those starting 
with j, having width at least L, and ending in [x,y\. 

PROOF. We prove this by induction over the number of previous calls to 
FindMatchL. i' = m, as defined in the statement of Lemma 12, then this claim 
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is trivially true for j = i. Otherwise, i' is equal to the same value returned by 
the previous call to FindMatchL, and by induction, there is some j > i such 
that segment A{j,i') is such a maximum- density segment. □ 

Lemma 14 The data structure supports its operations with amortized running 
times of 0{y — x + 1) for InitializeL(x, y), and 0{1) for FindMatchL(i). 

PROOF. With the exception of lines 2, 7 and 9, the initialization procedure 
is simply a restatement of the algorithm given by Lin, Jiang and Chao [LJC02] 
for constructing the right-skew pointers. An 0{y—x+l)-time worst-case bound 
was proven by those authors. 

In analyzing the cost of FindMatchL we note that variables i and u are ini- 
tialized to value y at line 9 of InitializeL. Variable £ is modified only when 
decremented at line 2 of FindMatchL and remains at least x + 1 due to the 
condition at line 1. Therefore, the loop of lines 1-6 executes at most y — x + 1 
times and this cost can be amortized against the initialization cost. Variable u 
is modified only at line 8. By Lemma 11, x < £ < b < u < p[b], and so this line 
results in a strict decrease in the value of u yet u remains at least x. Therefore, 
the while loop of lines 7-12 executes 0{y — x+1) times. The only step within 
that loop which cannot be bounded by 0(1) in the worst case is that of line 10. 
However, since each k appears in list S[u] for a distinct value of u, the overall 
cost associated with fine 10 is bounded by 0{y — x + 1). Therefore the cost 
of this while loop can be amortized as well against the initializaiton cost. An 
0(1) amortized cost per call can account for all remaining instructions outside 
of the loops. □ 

4.1.2 An 0{n)-time Algorithm 

In Figure 6, we present a linear-time algorithm for the maximum-density seg- 
ment problem subject only to a lower bound of L on the segment width. The 
algorithm makes use of the data structure developed in Section 4.1.1. 

Theorem 15 Given a sequence A, the algorithm MaximumDensitySegmentL 
finds the maximum- density segment of those with width at least L. 

PROOF. To prove the correctness, assume that ji is the density of an optimal 
such segment. First, we note that for any value i, Lemma 12 assures that g[i] 
is set such that g[i\ > Lj. Therefore, //(A;, g[k]) < jl for all k for which g[k\ was 
defined. 

We claim that for some fc, value g[k] is set such that ii{k,g[k]) > fi. Assume 
that the maximum density fi is achieved by some segment A{i,i'). Since it 
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procedure MaximumDensitySegmentL(A, L) 



1 [calculate partial sums, Lj, as discussed in Section 2] 

2 call InitializeL(l, n) to create data stucture 

3 io <— maximum index such that Ljg is well-defined 

4 for i ^ io downto 1 do 

5 if {Li = y) then / / only one feasible right index 

6 g[i] ^ y 

7 else 

8 g[i] FindMatchL(i) 

9 end if 

10 end for 

11 return {k,g[k]) which maximizes iJ,{k,g[k]) for I < k < io 



Fig. 6. Algorithm for finding maximum-density segment with width at least L 

must be that Lj is well-defined, we consider the pass of the loop starting at 
hne 4 for such an i. Lemma 13 assures us that if FindMatchL is called, it either 
returns i' or else it must be the case that for some j > i, g[j] < i' In this case, 
as U is unbounded. Lemma 6 assures us that n{j,g[j]) > fJ.{i,i') = fi. And 
thus MajximumDensitySegmentL returns a segment with density /t. □ 

Theorem 16 Given a sequence A of length n, MaximumDensitySegmentL 

runs in 0{n) time. 

PROOF. This is a direct consequence of Lemma 14. □ 

4..2 Maximum- Density Segment with Width at Least L and at Most U 

In this section, we consider the problem of finding a segment with the maxi- 
mum possible density among those of width at least L and at most U . At first 
glance, the sweeping of variable u in the previous algorithm appears similar 
to placing an explicit upper bound on the width of the segments of interest 
for a given left index i. In locating the good partner for i, a sequential search 
is performed over right-skew segments of DRSP(A(Lj + l,n)). The repeated 
decision of whether it is advantageous to include the bridge segment A{h,p\l)\) 
is determined in accordance with the bitonic property of Lemma 4. 

The reason that this technique does not immediately apply to the case with 
an explicit upper bound of U is the following. If the right endpoint of the 
bridge, is strictly greater than f/j, considering the effect of including the 
entire bridge may not be relevant. To properly apply Lemmas 3-4, we must 
consider segments of DRSP(A(Lj -|- 1, Ui)) as opposed to DRSP(A(Li -|- 1, n)). 
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procedure InitializeU(a;, y) 
for i ^ X + 1 to y do 

^ i 

while {{q[i] > x) and (//(^[^[i] 
q[i] 4~ q[q\i] - 1] 



l],q\i]-l)<Mi\,i))) do 



assumes 1 < x < y < n 



end while 



end for 



Fig. 7. InitializeU operation. 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 




14154543414253 



q[i] 



2 3 3 3 3 3 8 8 10 10 12 10 10 



Fig. 8. Example of data strueture after InitializeU(l, 14), with Wi = 1 for all i. 
4-2.1 Another Sweep-Line Data Structure 

Recall that the structure of Section 4.1.1 focused on finding segments begin- 
ning with i, ending in [x, y] and subject to a lower bound on the resulting 
segment width. Therefore, as i was decreased, the effective lower bound, Lj, 
on the matching endpoint can only decrease. The decomposition of interest 
was DRSP(/l(Lj, y)), and such decompositions were simultaneously represented 
for all possible values of Lj by the right-skew pointers, p[k]. 

In this section, we develop another sweep-line data structure that we used 
to locate segments beginning with i, ending in [x,y] and subject to an upper 
bound on the resulting segment width (but with no explicit lower bound) . For 
a given i, the decomposition of interest is DRSP(74(a: -|- 1, [/,)). However, since 
Ui decreases with f, our new structure is based on representing the decreasingly 
right-skew partitions for all prefixes of A{x +l,y), rather than all suffixes. We 
assign values q[k] ioi x + 1 < k < y such that yl(g[A;],A;) is the rightmost 
segment of T!lK2'P{A{x -\- l,k)). Though there are clear symmetries between 
this section and Section 4.1.1, there is not a perfect symmetry; in fact the 
structure introduced in this section is considerably simpler. The lack of perfect 
symmetry is because the concept of right-skem segments, used in both sections, 
is oriented. 

The initialization routine for this new structure is presented in Figure 7. An 

example of an initialized structure is given in Figure 8. The redesign of the 
initialization routine relies on a simple duality when compared with the cor- 
responding routine of Section 4.1.1. One can easily verify that an execution 
of this routine on a segment A{x, y) sets the values of array q precisely as 
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procedure FindMatchU(i) 
while {u > Ui) do 



/ / decrease u 



u ^ u — 1 
end while 

while {u > x) and {ijL{i,q[u\ — 1) > iJi{i,u)) do 



/ / bitonic search 



u ^ q[u] — 1 
end while 
return u 



Fig. 9. FiiidMatchU(i) operation. 



the original version would set the values of array p if run on a reversed and 
negated copy of A{x,y). Based on this relationship, we claim the following 
dual of Lemma 7 without further proof. 

Lemma 17 Immediately after InitializeU(a;, |/), the segment A{q[k],k) is 
the rightmost segment of DRSP{A{x + 1, k)), for all k in the range [x + l,y]. 

We now present the main query routine, FindMatchU, given in Figure 9, and 
discuss its behavior. 

Lemma 18 Assume FindMatchU(i) is called with a value i less than that of 
all previous invocations and such that x < Ui < y. Let mo be the most recently 
returned value from FindMatchU() or y if this is the first such call. Let A{i, m) 
he a maximum- density segment of those starting with i, having width at most 
U , and ending with m e [x,mo]. Then FindMatchU(i) returns the value m. 



PROOF. Combining the constraints that w{i,m) < U and that m e [a;,mo], 
it must be that m < min(C/j, mo). When entering the procedure, the variable u 
has value mo. The loop starting at line 1 ensures that variable u — mm{Ui, mo) 
upon the loop's exit. The discussion in Section 3 assures us that the op- 
timal m G [x, u] must either be x or else among the right endpoints of 
DRSP(74(a; + l,u)). Based on Lemma 17, A{q[u],u) is the rightmost segment 
of DRSP(74(a; + 1,m)) and so the loop condition at line 4 of FindMatchU is a 
direct application of Lemma 4. □ 

Lemma 19 The data structure supports its operations with amortized running 
times of 0{y — X + 1) /or InitializeU(x, y), andO{l) /or FindMatchU(i), so 
long as Ui > X for all i. 



PROOF. The initialization procedure has an 0{y — x + l)-time worst-case 
bound, as was the case for the similar routine in Section 4.1.1. 
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To account for the cost of FindMatchU, wc note that u is initiahzed to value 
y at hne 7 of Initial izeU. It is only modified by lines 2 and 5 of the routine, 
and we claim that both lines strictly decrease the value. This is obvious for 
line 2, and for line 5, it follows for FindMatchU, since q[u] <uva. accordance 
with Lemma 17. We also claim that u is never set less than x. Within the loop 
of lines 1-3, this is due to the assumption that Ui > x. For the loop of lines 4- 
7, it is true because q[u\ > a; + 1 in accordance with Lemma 17. Therefore, 
these loops execute at most 0(y — a; + 1) times combined and this cost can be 
amortized against the initialization cost. An 0(1) amortized cost per call can 
account for checking the initial test condition before entering either loop. □ 



4-2.2 An 0{n) -time Algorithm for the Uniform Model 

In this section, we present a linear-time algorithm for the uniform maximum- 
density segment problem subject to both a lower bound L and an upper bound 
U on the segment width, with L < U. 

Our strategy is as follows. We pre-process the original sequence by breaking it 
into blocks of cardinality exactly U — L (except, possibly for the last block). 
For each such block, we maintain two sweep-line data structures, one as in 
Section 4.1.1 and one as in Section 4.2.1. 

For a given left index i, a valid good partner must lie in the range [Lj, Ui]. 
Because we consider the uniform model, such an interval had cardinality pre- 
cisely {U — L + 1) and thus overlaps exactly two of the pre-processed blocks. 
For a = {U — L) \Li/ {U — L)] , we search for a potential partner in the range 
[Li, a] using the data structure of Section 4.1.1, and for a potential partner in 
the range [a + 1, Ui] using the data structure of Section 4.2.1. Though we are 
not assured of finding the true good partner for each i, we again find the global 
optimum, in accordance with Lemma 6. Our complete algorithm is given in 
Figure 10. 

Theorem 20 Given a sequence A of length n for which Wi — 1 for all i, 
and parameters L < U , the algorithm MaximumDensitySegmentLU finds the 
maximum- density segment of those with width at least L and at most U . 



PROOF. Because Wj = 1 for all i, the interval has cardinality pre- 

cisely [U — L We note that z is chosen at hne 12 so that Lj hes within 
Block£,[2;] and that C/j lies within 'Rlocku[z -\- 1]. 

To prove the correctness, assume that fi is the density of an optimal such 
segment. First, we claim that L < w{i,g[i]) < U for any i and thus that the 
algorithm cannot possibly return a density greater than ji. This is a direct 
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procedure MaximumDensitySegmentLU(A, L, U) 



1 [calculate values and t/j, as discussed in Section 2] 

2 leftend <— 1 

3 while {leftend < n) do // initialize blocks 

4 rightend <— min(n, leftend + (U — L) — 1) 

5 z ^ {leftend -I)/ {U - L) 

6 Block£,[z] ^ lnitia.lizeL{le f tend, rightend) 

7 Blocku[z] <— lnitia.lize\J{lef tend, rightend) 

8 leftend ^ leftend + {U - L) 

9 end while 

10 i^o ^ maximum index such that Ljp is well-defined 

1 1 for i io downto 1 do 

12 2; <— \Li/ (U — L)~\ 1 1 determine which blocks to search 

13 gj}^ <— FindMatchL(i) for Blockifz] 

14 gu\^ ^ FindMatchU(i) for B\oz^u\z + 1] 

15 if {\x{i, gi\^ > n{i, gu[i])) then 

16 g[i] ^ gili] 

17 else 

18 g[i] ^ gu[i] 

19 end if 

20 end for 

21 return {k,g[k]) which maximizes fi{k,g[k]) for 1 < A: < io 



Fig. 10. Algorithm for finding maximum-density segment with width at least L and 

at most U 

result of Lemma 12 in regard to the block searched at line 13 and Lemma 18 
in regard to the block searched at line 14. 

Wc then claim that for some k, value g[k] is set such that ii{k,g[k]) > fi. 
Assume that the maximum density fi is achieved by some segment A{i,i'). 
Since it must be that Lj is well-defined, we consider the pass of the loop 
starting at fine 11 for such an i. We show that either g[i] = i' or else there 
exists some j > i such that iJ,{j,g[j]) > ii{i,i'). If i' lies within Blocks [2;] 
then we apply the same reasoning about the call to FindMatchL at line 13, 
as we did in the proof of Theorem 15. If i' lies within Block[/[2; + 1] then we 
consider the behavior of the call to FindMatchU(i) at line 14. If that call does 
not return i' then it must be that i' > mo as defined in Lemma 18. Since the 
return values of this method are non-increasing, we let j be the largest index 
for which the returned g[j] < i'. At the onset of that call to FindMatchU it 
must have been that tuq > i' and therefore we can apply Lemma 6 to deduce 
that iJ,{j,g[j]) > iJ,{i,i'). Thus MaximumDensitySegmentLU returns a segment 
with density p,. □ 

Theorem 21 Given a sequence A of length n for which Wi — 1 for all i, the 
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algorithm McLximumDensitySegmentLU runs in 0{n) time. 



PROOF. This is a consequence of Lemmas 14 and 19. The calls to the initial- 
ization routines in the loop of lines 3-9 runs in 0{n) time, as each initialization 
routine is called for a set of blocks which partition the original sequence. Sim- 
ilarly, the cost of all the calls to FindMatchL and FindMatchU in the loop of 
lines 11-20 can be amortized against the corresponding initializations. □ 



4-2.3 An 0{n + n log(C/ — L + l))-time Algorithm for a More General Model 

With general values of Wi, the algorithm in Section 4.2.2 is insufficient for 
one of two reasons. If values oi Wi > 1 are allowed, it may be that an entire 
interval [Li, Ui] falls in a single block, in which case neither of the sweep- line 
data structures suffice. Alternatively, if values of Wj < 1 are allowed, then an 
interval [Li, Ui] might span an arbitrary number of blocks. Always searching all 
such blocks might result in a;(n) overall calls to FindMatchL or FindMatchU. 

In this section, we partially address the general case, providing an 0{n + 
nlog{U — L + l))-time algorithm when Wi> 1 for all i. This condition assures 
us that the interval [Li, Ui] has cardinality at most {U — L + 1). Rather than 
rely on a single partition of the original sequence into fixed-sized blocks, we 
will create 0{\og{U — L + 1)) partitions, each of which uses fixed-sized blocks, 
for varying sizes. Then we show that the interval [Lj, Ui\ can be covered with 
a collection of smaller blocks in which we can search. 

For ease of notation, we assume, without loss of generality, that the overall 
sequence A is padded with values so that it has length n which is a power 
of two. We consider n blocks of size 1, n/2 blocks of size 2, n/4 blocks of 

size 4, and so on until n/2'^ blocks of size 2^, where [3 = [log2(?7 — L + 
Specifically, we define block Bj^k = A{1 + j * 2^ (j + 1) * 2^) for al\0<k<p 
and < j < We begin with the following lemma. 

Lemma 22 For any interval A{p, q) with cardinality at most U — L + 1, we 
can compute, in 0(1 + (3) time, a collection of 0{1 + (5) disjoint blocks such 
that A{p, q) equals the union of the blocks. 

The algorithm CollectBlocks is given in Figure 11, and a sample result is 
shown in Figure 12. It is not difficult to verify the claim. 

Theorem 23 Given a sequence A of length n for which Wi > 1 for all i, a 
maximum- density segment of those with width at least L and at most U can 
be found in 0{n -\- n \og{U — L time. 
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procedure CollectBlocks(p, q) 



1 s ^ p; k ^ 0; 

2 while (2*= + s - 1 < g) do 

3 while (2*^+1 + s - 1 < q) do 

4 + 1 

5 end while 

6 j ^ rs/2'^1 - 1 

7 Add block Bj^k to the collection 

8 k^k+1 

9 end while 

10 while (s < q) do 

11 while (2'' + s - 1 > g) do 

12 k^k-1 

13 end while 

14 j ^ \s/2''] - 1 

15 Add block Bj^^ to the collection 

16 k^k-1 

17 end while 



Fig. 11. Algorithm for finding collection of blocks to cover an interval 



:rin 



Fig. 12. A cohection of blocks for a given interval 

PROOF. The algorithm is given in Figure 13. First, we discuss the correct- 
ness. Assume that the global maximum is achieved by A(i,i'). We must sim- 
ply show that this pair, or one with equal density, was considered at line 14. 
By Lemma 22, i' must lie in some Bj^k returned by CollectBlocks(Lj, ?7j). 
Because iJ.{i,i') is a global maximum, the combination of Lemmas 6 and 12 
assures us that FindMatchLO at line 10 will return i' or else some earlier pair 
which was found has density at least as great. 

We conclude by showing that the running time is 0{n + nf3). Notice that 
for a fixed /c, blocks Bj^^ < j < n/2'^ — 1 comprise a partition of the 
original input A, and thus the sum of their cardinalities is 0{n). Thefefore, 
for a fixed /c, lines 4-6 run in 0{n) time by Lemma 14, and overall lines 3-7 
run in 0{nj3) time. Each call to CollectBlocks from line 9 runs in 0(1 + (3) 
time by Lemma 22, and produces 0(1 + j3) blocks. Therefore, the body of 
that loop, lines 9-14, executes 0{n + n(3) times over the course of the entire 
algorithm. 
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procedure MaximumDensitySegmentLU2(A, L, U) 



1 [calculate values Lj and f/j, as discussed in Section 2] 

2 (5 ^ \}-Og2{U - L + 1)J ; ii^ax < oo; 

3 for A; ^ to /3 do 

4 for j ^ to n/2^ - 1 do 

5 Bj^k ^ InitializeL(l + j * 2^, (j + 1) * 2^) 

6 end for 

7 end for 

8 io ^ maximum index such that Lj is defined 

9 foreach Bj^k in CollectBlocks(Lj, C/j) do 

10 temp ^ FindMatchL(z) 

11 if ji{i,temp) > Umax then 

12 IJ'max ^ n{i,temp); record cndpoints {i,temp) 

13 end if 

14 end foreach 

15 end for 



Fig. 13. Algorithm for maximum-density segment with width at least L, at most U 

Finally, we must account for the time spent in all calls to FindMatchL from 
line 10. Rather than analyze these costs chronologically, we account for these 
calls by considering each block Bjk over the course of the algorithm. By 
Lemma 14, each of these calls has an amortized cost of 0(1), where that 
cost is amortized over the initialization cost for that block. □ 
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